[`Attn Masks`] Non-vmap default for attention masks #41852

vasqu · 2025-10-24T19:51:29Z

Non-vmap creation of masks. These work with all our base masks and we only default back to vmap when using patterns we cannot guarantee (i.e. additional and/or masks).

Note:

Non-vmap works with every mask that has anything index based
Merged old/new sdpa under one function --> easier maintenance imo
Executorch does not need an additional masking fn anymore
Lifts some restrictions on older torch versions, e.g. chunked attn with padding, packed attn masks etc

Fixes #41639

cc @jiqing-feng @IlyasMoutawwakil

HuggingFaceDocBuilderDev · 2025-10-24T20:00:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

… properly

…nteed masks

vasqu · 2025-10-29T11:08:53Z

src/transformers/integrations/executorch.py

    return cache
-
-
-def sdpa_mask_without_vmap(


No longer needed as vmap was the reason we needed this workaround in the first place

vasqu · 2025-10-29T11:09:11Z

src/transformers/masking_utils.py

+    NOTE: It is important to keep an index-based version for non-vmap expansion.
    """
-    return q_idx.new_ones((), dtype=torch.bool)
+    return q_idx >= 0


As noted above, for non-vmap we need this as index based version

src/transformers/masking_utils.py

vasqu · 2025-10-29T11:12:11Z

src/transformers/masking_utils.py

-        causal_mask |= torch.all(~causal_mask, dim=-1, keepdim=True)
-    return causal_mask
-
+        attention_mask = attention_mask | torch.all(~attention_mask, dim=-1, keepdim=True)


I encountered issues with the inplace version where we'd need a clone (e.g. when using swa). This is safer

ArthurZucker

can we add a test to default test no graph break on this?

jiqing-feng · 2025-11-07T08:00:55Z

Hi @vasqu . Anything blocks merge?

Cyrilvallez

Very very nice, this solves quite a lot of different issues at the same time! I'm very happy to avoid special handling for export! Very clever use of broadcasting from the optimum team, I did not know we could simply do such things! Thanks a lot for upstreaming directly to us.

Do you mind expanding a bit more on what are the limitations of the broadcasting approach here for posterity? Is it only index-based operations as you mention on the comments, or are there more subtle things?

IlyasMoutawwakil · 2025-11-10T13:45:56Z

Do you mind expanding a bit more on what are the limitations of the broadcasting approach here for posterity? Is it only index-based operations as you mention on the comments, or are there more subtle things?

Nothing I'm aware of, the only condition is to write the mask_function as a comparison between the indexes (and constants).
One example of this is the bidirectional_mask_function which was written as q_idx.new_ones((), dtype=torch.bool) simply returns a scalar 1 (true) and @vasqu rewrote it as q_idx >= 0 (always true). I think any mask function can be written as f(b, h, q, kv) but I can't prove it 😂

vasqu · 2025-11-10T15:04:11Z

Merging this then! Let's see what crazy masks come up in the future; for now the "mask hypothesis" holds 😆

vasqu · 2025-11-10T15:07:41Z

@jiqing-feng was mainly blocked by me being out last week ;)

vasqu added 2 commits October 24, 2025 20:39

atmpt 1

dfcb545

fixup masking to work correctly with old torch

b87a139

vasqu and others added 6 commits October 24, 2025 22:00

Merge branch 'main' into non-vmap-masks

9aed30d

few changes to make things a bit more cleaner

969dab5

oopsie

513c8ef

fix integer overflow on bidirectional masks via indexing fn

466acab

rm executorch workarounds --> still need to handle on sliding etc fns…

bbaf41d

… properly

typo

65357d9

vasqu mentioned this pull request Oct 27, 2025

T5 migration to new masking interface #41804

Merged

4 tasks

vasqu added 3 commits October 28, 2025 18:12

docs, fix older torch inplace issue, proper kwarg handling

aaaaec2

chunked works with non vmap and older torch, add warning on non guara…

539bafa

…nteed masks

lift unnecessary restriction on older torch

01848e3

vasqu changed the title ~~[WIP][Masking] Non-vmap default for attention masks~~ [Attn Masks] Non-vmap default for attention masks Oct 29, 2025

Merge branch 'main' into non-vmap-masks

9dc6296

vasqu marked this pull request as ready for review October 29, 2025 11:06

vasqu requested review from ArthurZucker and Cyrilvallez October 29, 2025 11:06

vasqu commented Oct 29, 2025

View reviewed changes

src/transformers/masking_utils.py Outdated Show resolved Hide resolved

vasqu commented Oct 29, 2025

View reviewed changes

vasqu mentioned this pull request Oct 30, 2025

Fix executorch export with dynamic shapes #41559

Draft

5 tasks

vasqu added 3 commits October 31, 2025 11:16

simplify a few things, restrict torch < 2.6 to non-vmap (for now)

17c7a48

try fix

4e6e799

remove unnecessary slicing logic

26b266c

ArthurZucker approved these changes Nov 3, 2025

View reviewed changes

remove legacy func

1fb7510

Cyrilvallez approved these changes Nov 10, 2025

View reviewed changes

harmonize slightly

4f62c81

vasqu merged commit 03538a8 into huggingface:main Nov 10, 2025
23 checks passed

vasqu deleted the non-vmap-masks branch November 10, 2025 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[`Attn Masks`] Non-vmap default for attention masks #41852

[`Attn Masks`] Non-vmap default for attention masks #41852

Uh oh!

vasqu commented Oct 24, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Oct 24, 2025

Uh oh!

vasqu Oct 29, 2025

Uh oh!

vasqu Oct 29, 2025

Uh oh!

Uh oh!

vasqu Oct 29, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

jiqing-feng commented Nov 7, 2025

Uh oh!

Cyrilvallez left a comment

Uh oh!

IlyasMoutawwakil commented Nov 10, 2025 •

edited

Loading

Uh oh!

vasqu commented Nov 10, 2025

Uh oh!

Uh oh!

vasqu commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[Attn Masks] Non-vmap default for attention masks #41852

[Attn Masks] Non-vmap default for attention masks #41852

Uh oh!

Conversation

vasqu commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 24, 2025

Uh oh!

vasqu Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vasqu Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

jiqing-feng commented Nov 7, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Nov 10, 2025

Uh oh!

Uh oh!

vasqu commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[`Attn Masks`] Non-vmap default for attention masks #41852

[`Attn Masks`] Non-vmap default for attention masks #41852

vasqu commented Oct 24, 2025 •

edited

Loading

IlyasMoutawwakil commented Nov 10, 2025 •

edited

Loading